A discipline of dynamic programming over sequence data

نویسندگان

Robert Giegerich

Carsten Meyer

Peter Steffen

چکیده

Dynamic programming is a classical programming technique, applicable in a wide variety of domains such as stochastic systems analysis, operations research, combinatorics of discrete structures, flow problems, parsing of ambiguous languages, and biosequence analysis. Little methodology has hitherto been available to guide the design of such algorithms. The matrix recurrences that typically describe a dynamic programming algorithm are difficult to construct, error-prone to implement, and, in nontrivial applications, almost impossible to debug completely. This article introduces a discipline designed to alleviate this problem. We describe an algebraic style of dynamic programming over sequence data. We define its formal framework, based on a combination of grammars and algebras, and including a formalization of Bellman’s Principle. We suggest a language used for algorithm design on a convenient level of abstraction. We outline three ways of implementing this language, including an embedding in a lazy functional language. The workings of the new method are illustrated by a series of examples drawn from diverse areas of computer science. 1 Power and scope of dynamic programming 1.1 Dynamic programming: a method “by example” Computer science knows a handful of programming methods that are useful across many domains of application. Such methods are, for example, Structural Recursion, Divide-and-Conquer, Greedy Algorithms and Genetic Algorithms. Dynamic Programming (DP) is another classical programming method, introduced even before the term Computer Science was firmly established. When applicable, DP often allows one to solve combinatorial optimization problems over a search space of exponential size in polynomial space and time. Bellman’s “Principle of Optimality” [Bel57] belongs to the core knowledge of every computer science graduate. Significant work has gone into formally characterizing this principle [Sni92,Mor82,Mit64], formulating DP in different programming paradigms [Moo99,Cur97] and studying its relation to other general programming methods such as greedy algorithms [BM93]. 2 Robert Giegerich, Carsten Meyer, Peter Steffen The scope of DP is enormous. Much of the early work was done in the area of physical state transition systems and operations research [BD62]. Other, simpler examples (more suited for computer science textbooks) are optimal matrix chain multiplication, polygon triangulation, or string comparison. The analysis of molecular sequence data has fostered increased interest in DP. Protein homology search, RNA structure prediction, gene finding, and interpretation of mass spectrometry data pose combinatorial optimization problems unprecedented in variety and data volume. A recent textbook in biosequence analysis [DEKM98] lists 11 applications of DP in its introductory chapter, and many more in the sequel. Developing a DP algorithm for an optimization problem over a nontrivial domain has intrinsic difficulties. The choice of objective function and search space are interdependent, and inextricably linked to questions of efficiency. Once completed, all DP algorithms are expressed via recurrence relations between tables holding intermediate results. These recurrences provide a very low level of abstraction, and subscript errors are a major nuisance even in published articles. The recurrences are difficult to explain, painful to implement, and almost impossible to debug: A subtle error gives rise to a suboptimal solution every now and then, which is virtually undetectable by human inspection. In this situation it is remarkable that neither the literature cited above, nor many other computer science textbooks ([Gus97,Meh84,BB88,AHU83,Sed89], to name but a few) provide guidance in the development of DP algorithms. It appears that giving some examples and an informal discussion of Bellman’s Principle is all the methodology we can offer to our students and practitioners. Notable exceptions are the textbooks by Cormen et al. and Schöning [CLR90,Sch01], which recognize this deficiency and formulate guiding rules on how to approach a new DP problem. We shall critically review these rules in our conclusion section. This state of the art is nicely summarized in a quote from an (anonymous) referee commenting on an initial version of this work, who wrote: “The development of successful dynamic programming recurrences is a matter of experience, talent, and luck.” 1.2 Basic ideas of Algebraic Dynamic Programming Algebraic dynamic programming (ADP) is a new style of dynamic programming that gives rise to a systematic approach to the development of DP algorithms. It allows one to design, reflect upon, tune and even test DP algorithms on a more abstract level than the recurrences that used to be all that was available to deal with dynamic programming algorithms. Four steps based on mathematical concepts guide the algorithm design. Many tricks that have been invented by practitioners of DP can be expressed as general techniques in ADP. The common aspects of related algorithms can be cleanly separated from their differences. On the implementation side, ADP exactly reproduces the classical DP recurrences. In principle, nothing is lost in terms of efficiency. All this together makes us 1 Asymptotic efficiency is preserved, while the constant factors depend on the mode of implementation, discussed in Chapter 5. A Discipline of Dynamic Programming over Sequence Data 3 feel that Dynamic Programming is more and more becoming a discipline, rather than a “matter of experience, talent and luck”. How can this be achieved? Any DP algorithm evaluates a search space of candidate solutions under a scoring scheme and an objective function. The classical DP recurrences reflect the four aspects of search space construction, scoring, choice, and efficiency in an indiscriminate fashion. In any systematic approach, these concerns must be separated. The algebraic approach to be presented here proceeds as follows: The search space of the problem at hand is described by a yield grammar, which is a tree grammar generating a string language. The ADP developer takes the view that for a given input sequence, “first” the search space is constructed, leading to an enumeration of all candidate solutions. This is a parsing problem, solved by a standard device called a tabulating yield parser. The developer can concentrate on the design of the grammar. Evaluation and choice are captured by an evaluation algebra. It is important (and in contrast to traditional presentations of DP algorithms) that this algebra comprises all aspects relevant to the intended objective of optimization, but is independent of the description of the search space. The ADP developer takes the view that a “second” phase evaluates the candidates enumerated by the first phase, and makes choices according to the optimality criterion. Of course, the interleaving of search space construction and evaluation is essential to prevent combinatorial explosion. This interleaving is contributed by the ADP method in a way transparent to the developer. By the separate description of search space and evaluation, ADP produces modular and therefore re-usable algorithm components. Often, related optimization problems over the same search space can be solved merely by a change of the algebra. More complex problems can be approached with a better chance of success, and there is no loss of efficiency compared to ad-hoc approaches. Avoiding the formulation of explicit recurrences is a major relief, an effect captured by early practitioners of ADP in the slogan “No subscripts, no errors!”. We hope that the application examples presented in this article will convince the reader that following the guidance of ADP in fact brings about a boost in programming productivity and program reliability. The ADP approach has emerged recently in the context of biosequence analysis, where new dynamic programming problems arise almost daily. In spite of its origin in this application domain, ADP is relevant to dynamic programming over sequential data in general. “Sequential data” does not mean we only study string problems – a chain of matrices to be multiplied, for example, is sequential input data in our sense, as well as the peak profiles provided by mass spectrometry. An informal introduction to ADP, written towards the needs of the bioinformatics community, has appeared in [Gie00a]. The present article gives a complete account of the foundations of the ADP method, and shows its application to several classical combinatorial optimization problems in computer science. Like any methodological work, this article suffers from the dilemma that for the sake of exposition, the problems treated here have to be rather simple, 4 Robert Giegerich, Carsten Meyer, Peter Steffen such that the impression may arise that methodological guidance is not really required. The ADP method has been applied to several nontrivial problems in the field of biosequence analysis. An early application is a program for aligning recombinant DNA [GKW99], when the ADP theory was just about to emerge. Two recent applications are searching for sequence/structure motifs in DNA or RNA [MG02], and the problem of folding saturated RNA secondary structures, posed by Zuker and Sankoff in [ZS84] and solved in [EG01]. We shall give a short account of such “real world” applications. 1.3 Overview of this article In Section 2 we shall review some new and some well known applications of dynamic programming over sequence data, in the form in which they are traditionally presented. This provides a common basis for the subsequent discussion. By the choice of examples, we illustrate the scope of dynamic programming to a certain extent. In particular, we show that (single) sequence analysis and (pairwise) sequence comparison are essentially the same kind of problem when viewed on a more abstract level. The applications studied here will later be reformulated in the spirit and notation of ADP. In Section 3 we introduce the formal basis of the ADP method: Yield grammars and evaluation algebras. We shall argue that these two concepts precisely catch the essence of dynamic programming, at least when applied to sequence data. Furthermore, we introduce a special notation for expressing ADP algorithms. Using this notation an algorithm is completely described on a very abstract level, and can be designed and analyzed irrespective of how it is eventually implemented. We discuss efficiency analysis and point to other work concerning techniques to improve efficiency. In Section 4 we formulate the ADP development method and develop yield grammars and evaluation algebras for the applications described in Section 2. Moreover we show how solutions to problem variants can be expressed transparently using the ADP approach. Section 5 indicates three ways of actually implementing an algorithm once it is written in ADP notation: The first alternative is a direct embedding and execution in a functional programming language, the second is manual translation to the abstraction level of an imperative programming language. The third alternative, still under development, is the use of a system which directly compiles ADP notation into C code. In the conclusion, we discuss the merits of the method presented here, evaluate its scope, and glance at its possible extensions. This article may be read in several different ways. Readers familiar with standard examples of dynamic programming may jump right away to the theory in Section 3. Readers mainly interested in methodology, willing to take for granted that ADP can be implemented without loss of efficiency, may completely skip Section 5. A Discipline of Dynamic Programming over Sequence Data 5 2 Dynamic programming in traditional style In this section we discuss four introductory examples of dynamic programming, solved by recurrences in the traditional style. Three will be reformulated in algebraic style in Section 4. We begin our series of examples with an algorithmic fable. 2.1 The oldest DP problem in the world Our first example dates back to the time at around 800. Al Chwarizmi, today known for his numerous important discoveries in elementary arithmetic and dubbed as the father of algorithmics, was a scholar at the House of Wisdom in Baghdad. At that time, the patron of the House of Wisdom was El Mamun, Calif of Baghdad and son of Harun al Raschid. It is told, that one day the Calif called for Al Chwarizmi for a special research project. He presented the formula 1 + 2 ∗ 3 ∗ 4 + 5, which had its origins in a bill for a couple of camels, as he remarked. Unfortunately, the formula was lacking the parentheses. The task was to find a general method to redraw the parentheses in the formula (and any similar one) such that the outcome was either minimized or maximized – depending on whether the Calif was on the buying or on the selling side. We now provide a DP solution for El Mamun’s problem. Clearly, explicit parentheses add some internal structure to a sequence of numbers and operators. They tell us how subexpressions are grouped together – which are sums, and which are products. Let us number the positions in the text t representing the formula: t = 0 1 1 + 2 2 3 ∗ 4 3 5 ∗ 6 4 7 + 8 5 9 (1) such that we can refer to substrings by index pairs: t(0, 9) is the complete string t, and t(2, 5) is 2 ∗3. A substring t(i, j) that forms an expression can, in general, be evaluated in many different ways, and we shall record the best value for t(i, j) in a table entry T (i, j). Since addition and multiplication are strictly monotone functions on positive numbers, an overall value (x + y) or (x ∗ y) can only be maximal if both subexpressions x and y are evaluated to their maximal values. So it is in fact sufficient to record the maximum in each entry. This is our first use of Bellman’s Principle, to be formalized later. More precisely, we define T (i, i + 1) = n, if t(i, i + 1) = n (2) T (i, j) = max{T (i, k)⊗ T (k + 1, j)|i < k < j, t(k, k + 1) = ⊗} (3) where ⊗ is either + or ∗. Beginning with the shortest subwords of t, we can compute successively all defined table entries. In T (0, 9) we obtain the maximal possible value overall. If, together with T (i, j), we also record the position k within (i, j) that leads to the optimal value, then we can reconstruct the reading of the formula that yields the optimal value. 6 Robert Giegerich, Carsten Meyer, Peter Steffen It is clear that El Mamun’s minimization problem is solved by simply replacing max by min. Figure 1 gives the results for maximization and minimization of El Mamun’s bill. 0 1 2 3 4 5 6 7 8 9 0 / (1,1) (3,3) (7,9) (25,36) (30,81) 1 / / 2 / / / (2,2) (6,6) (24,24) (29,54) 3 / / / / 4 / / / / / (3,3) (12,12) (17,27) 5 / / / / / / 6 / / / / / / / (4,4) (9,9) 7 / / / / / / / / 8 / / / / / / / / / (5,5) 9 / / / / / / / / / / Fig. 1. Results for the maximization and minimization of El Mamun’s bill denoted as tuple (x, y) where x is the minimal value and y the maximal value. Note that we have left open a few technical points: We have not provided explicit code to compute the table T , which is actually triangular, since i is always smaller than j. Such code has to deal with the fact that an entry remains undefined when t(i, j) is not a syntactically valid expression, like t(1, 4) = “+ 2 *”. In fact, there are about as many undefined entries as there are defined ones, and we may call this a case of sparse dynamic programming and search for a more clever form of tabulation. Another open point is the possibility of malformed input, like the non-expression “1 + ∗ 2”. The implementation shown later will take care of all these cases. The first discovery Al Chwarizmi made, was that there were 14 different ways to evaluate the bill. In Section 4 we will see that the solution for this problem closely follows the recurrences just developed, except that there is no maximization or minimization involved. This is a combinatorial counting problem. Although DP is commonly associated with optimization problems, we will see that its scope is actually wider. 2.2 Matrix chain multiplication A classical dynamic programming example is the matrix chain multiplication problem [CLR90]. Given a chain of matrices A1, ..., An, find an optimal placement of parentheses for computing the product A1∗...∗An. Since matrix multiplication is associative, the placement of parentheses does not affect the final value. However, a good choice can dramatically reduce the number of scalar multiplications needed. Consider three matrices A1, A2, A3 with dimensions 10×100, 100×5 and 5×50. Multiplication of (A1 ∗A2)∗A3 needs 10∗100∗5+10∗5∗50 = 7500 A Discipline of Dynamic Programming over Sequence Data 7 scalar multiplications, in contrast to 10 ∗ 100 ∗ 50 + 100 ∗ 5 ∗ 50 = 75000 when choosing A1 ∗ (A2 ∗A3). Let M be a n× n table. Table entry M(i, j) shall hold the minimal number of multiplications needed to calculate Ai ∗ ... ∗ Aj . Compared to the previous example, the construction of the search space is considerably easier here since it does not depend on the structure of the input sequence but only on its length. M(i, j) = 0 for i = j. In any other case there exist j− i possible splittings of the matrix chain Ai, ..., Aj into two parts (Ai, ..., Ak) and (Ak+1, ..., Aj). Let (ri, ci) be the dimension of matrix Ai, where ci = ri+1 for 1 ≤ i < n. Multiplying the two partial product matrices requires rickcj operations. Again we observe Bellman’s Principle. Only if the partial products have been arranged internally in an optimal fashion, can this product minimize scalar multiplications overall. We order table calculation by increasing subchain length, such that we can look up all the M(i, k) and M(k + 1, j) when needed for computing M(i, j). This leads to the following matrix recurrence: for j = 1 to n do (4) for i = j to 1 do M(i, j) = { 0 for i = j min{M(i, k) + M(k + 1, j) + rickcj | i ≤ k < j} for i < j return M(1, n) (5) Minimization over all possible splittings gives the optimal value for M(i, j). This example demonstrates that dynamic programming over sequence data is not necessarily limited to (character) strings, but can also be used with sequences of other types, in this case pairs of numeric values denoting matrix dimensions. 2.3 Global and local similarity of strings We continue our series of examples by looking at the comparison of strings. The measurement of similarity or distance of two strings is an important operation applied in several fields, for example spelling correction, textual database retrieval, speech recognition, coding theory, or molecular biology. A common formalization is the string edit model [Gus97]. We measure the similarity of two strings by scoring the different sequences of character deletions (denoted by D), character insertions (denoted by I) and character replacements (denoted by R) that transform one string into the other. If a character is unchanged, we formally model this as a replacement by itself. Thus, an edit operation is applied at each position. Figure 2 shows some possibilities to transform the string MISSISSIPPI into the string SASSAFRAS, visualized as an alignment. A similarity scoring function δ associates a similarity score of 0 with two empty strings, a positive score with two characters that are considered similar, a negative score with two characters that are considered dissimilar. Insertions and deletions also receive negative scores. For strings x of length m and y of length 8 Robert Giegerich, Carsten Meyer, Peter Steffen MISSI--SSIPPI MISSISSIPPIMISSI---SSIPPI SASSAFRAS------SASSAFRAS SASSAFRAS----RR RIIR DDDD DDD R RRRRI RR RIII DDDDD Fig. 2. Three out of many possible ways to transform the string MISSISSIPPI into the string SASSAFRAS. Only deletions, insertions, and proper replacements are marked. n, we compute the similarity matrix Em,n such that E(i, j) holds the similarity score for the prefixes x1, . . . , xi and y1, . . . , yj . E(m, n) therefore holds the overall similarity value of x and y. E is calculated by the following recurrences: E(0, 0) = 0 (6) for i = 0 to m− 1 do E(i + 1, 0) = E(i, 0) + δ(D(xi+1)) (7) for j = 0 to n− 1 do E(0, j + 1) = E(0, j) + δ(I(yj+1)) (8) for i = 0 to m− 1 do for j = 0 to n− 1 do E(i + 1, j + 1) = max   E(i, j + 1) + δ(D(xi+1)) E(i + 1, j) + δ(I(yj+1)) E(i, j) + δ(R(xi+1, yj+1))   (9)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

New scheduling rules for a dynamic flexible flow line problem with sequence-dependent setup times

In the literature, the application of multi-objective dynamic scheduling problem and simple priority rules are widely studied. Although these rules are not efficient enough due to simplicity and lack of general insight, composite dispatching rules have a very suitable performance because they result from experiments. In this paper, a dynamic flexible flow line problem with sequence-dependent se...

متن کامل

An Application of the ABS LX Algorithm to Multiple Sequence Alignment

We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...

متن کامل

Energy cost minimization in an electric vehicle solar charging station via dynamic programming

Environmental crisis and shortage of fossil fuels make Electric Vehicles (EVs) alternatives for conventional vehicles. With growing numbers of EVs, the coordinated charging is necessary to prevent problems such as large peaks and power losses for grid and to minimize charging costs of EVs for EV owners. Therefore, this paper proposes an optimal charging schedule based on Dynamic Programming (DP...

متن کامل

Vector Valued multiple of $chi^{2}$ over $p$-metric sequence spaces defined by Musielak

In this article, we define the vector valued multiple of $chi^{2}$ over $p$-metric sequence spaces defined by Musielak and study some of their topological properties and some inclusion results.

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

Dynamic Stability Analysis of a Beam Excited by a Sequence of Moving Mass Particles

In this paper, the dynamic stability analysis of a simply supported beam carrying a sequence of moving masses is investigated. Many applications such as motion of vehicles or trains on bridges, cranes transporting loads along their span, fluid transfer pipe systems and the barrel of different weapons can be represented as a flexible beam carrying moving masses. The periodical traverse of masses...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Sci. Comput. Program.

دوره 51 شماره

صفحات -

تاریخ انتشار 2004

A discipline of dynamic programming over sequence data

نویسندگان

چکیده

منابع مشابه

New scheduling rules for a dynamic flexible flow line problem with sequence-dependent setup times

An Application of the ABS LX Algorithm to Multiple Sequence Alignment

Energy cost minimization in an electric vehicle solar charging station via dynamic programming

Vector Valued multiple of $chi^{2}$ over $p$-metric sequence spaces defined by Musielak

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Dynamic Stability Analysis of a Beam Excited by a Sequence of Moving Mass Particles

عنوان ژورنال:

اشتراک گذاری